Skip to content

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395

Open
Copilot wants to merge 23 commits into
mainfrom
copilot/add-tar-gz-support
Open

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395
Copilot wants to merge 23 commits into
mainfrom
copilot/add-tar-gz-support

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

The extension, preset, and workflow download pipelines only accepted ZIP archives, blocking use of npm registries and CI artifact stores that serve tarballs natively.

Core utilities (extensions.py)

  • detect_archive_format(url, content_type="") — infers format from URL path extension (.zip, .tar.gz, .tgz) with Content-Type header fallback (application/gzip, application/x-gzip, application/x-tar+gzip)
  • safe_extract_tarball(archive_path, dest_dir, error_class) — safe extraction with:
    • Pre-extraction validation of all members: rejects absolute paths, .. traversal, symlinks, hard links, devices, and FIFOs
    • PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata-only entries (GNUTYPE_LONGNAME, GNUTYPE_LONGLINK) are silently skipped — they carry no extractable payload and are emitted by many common archiving tools; GNUTYPE_SPARSE is intentionally not skipped because sparse entries carry a real file payload and isreg() returns True for them
    • All Python versions: passes pre-validated safe_members list to extractall() to ensure only vetted entries are extracted
    • Python 3.12+: additionally uses tarfile.data_filter for extra OS-level protection, combined with members=safe_members
    • tarfile.TarError/OSError are caught and re-raised as the caller-supplied error_class for consistent error handling

Both helpers are public (no underscore prefix) and imported directly by presets.py and __init__.py.

Extensions & presets

  • install_from_zip() on both managers now detects archive format from the file extension and dispatches to ZIP or tarball extraction accordingly — existing callers are unaffected
  • download_extension() / download_pack() capture response.geturl() as the canonical post-redirect URL and detect archive format from the final URL first (with Content-Type fallback), only consulting the original download_url as a last hint. This ensures redirects that change the archive extension or type are honored when choosing the on-disk suffix and dispatching to the correct extractor. Unknown formats are rejected with a clear error rather than silently defaulting to ZIP, and the final URL's scheme is re-validated to guard against scheme-downgrade via redirects.

__init__.py call sites

Path Change
extension add --from Detects format from post-redirect final_url first (Content-Type fallback), then original URL as last hint; unknown formats rejected; HTTPS re-checked on post-redirect URL
preset add --from Same
extension update Inline manifest peek handles both ZIP and tar.gz; cache filename sanitized via Path(extension).name to prevent path traversal
workflow add (URL) Extracts workflow.yml from archive when URL points to one; temp-file paths initialized before write to avoid UnboundLocalError on disk-full
workflow add (local) Accepts local .tar.gz/.tgz/.zip archive files (case-insensitive detection)
workflow add (catalog) Same archive detection for catalog-sourced URLs

A shared _extract_workflow_yml(archive_path, fmt) helper handles root-level and single-nested-directory layouts for both formats, with tarfile.extractfile() handles properly closed via context managers.

Tests

30 new tests across test_extensions.py, test_presets.py, and test_workflows.py covering:

  • test_extensions.py: format detection (URL + Content-Type), flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_presets.py: flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_workflows.py (TestWorkflowAddArchive, 9 CLI-level tests): local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, URL-based archive download for both ZIP and tar.gz formats

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:04
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:06
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:09
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:12
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:14
Copilot AI changed the title [WIP] Add support for .tar.gz and .tgz archives Add .tar.gz / .tgz archive support for extension, preset, and workflow installation Apr 28, 2026
Copilot AI requested a review from mnriem April 28, 2026 18:16
Comment thread src/specify_cli/__init__.py Fixed
@mnriem mnriem marked this pull request as ready for review May 5, 2026 22:58
@mnriem mnriem requested review from Copilot and removed request for Copilot May 5, 2026 22:58
Comment thread src/specify_cli/__init__.py Fixed
Comment thread src/specify_cli/extensions.py Fixed
Comment thread src/specify_cli/presets.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 19:11
mnriem and others added 2 commits May 8, 2026 14:11
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 1

Comment thread src/specify_cli/extensions.py Outdated
# On older versions pass only the pre-validated members so that no
# unvetted entry (added concurrently or via a race) slips through.
if sys.version_info >= (3, 12):
tf.extractall(dest_dir, filter="data") # type: ignore[call-arg]
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 11, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot apply changes based on the comments in this thread

Fixed in commit — the Python 3.12+ branch now calls tf.extractall(dest_dir, members=safe_members, filter="data"), so the pre-validated member list is respected on all Python versions and metadata-only/rejected entries cannot slip through on 3.12+.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment thread src/specify_cli/extensions.py Outdated
Comment on lines +2159 to +2170
# Detect archive format from URL; resolve via Content-Type when needed.
# `final_url` may differ from `download_url` if the server redirects.
archive_fmt = detect_archive_format(download_url)

# Download the archive
try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
final_url = response.geturl()
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
archive_data = response.read()
Comment thread src/specify_cli/presets.py Outdated
Comment on lines +2316 to +2326
# Detect archive format from URL; resolve via Content-Type when needed.
# `final_url` may differ from `download_url` if the server redirects.
archive_fmt = detect_archive_format(download_url)

try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
final_url = response.geturl()
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
archive_data = response.read()
Comment on lines 2634 to +2643
with tempfile.TemporaryDirectory() as tmpdir:
zip_path = Path(tmpdir) / "preset.zip"
archive_fmt = _det_fmt(from_url)
final_url = from_url
try:
with urllib.request.urlopen(from_url, timeout=60) as response:
zip_path.write_bytes(response.read())
final_url = response.geturl()
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = _det_fmt(final_url, content_type)
archive_data = response.read()
Comment thread src/specify_cli/__init__.py Outdated
Comment on lines +3663 to +3672
archive_fmt = detect_archive_format(from_url)
archive_path = None

try:
with urllib.request.urlopen(from_url, timeout=60) as response:
zip_data = response.read()
zip_path.write_bytes(zip_data)
final_url = response.geturl()
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
archive_data = response.read()
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 12, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 12, 2026

@copilot apply changes based on the comments in this thread

Applied the change in all 4 URL-download paths (extensions.py download_extension(), presets.py download_pack(), __init__.py preset add --from, __init__.py extension add --from):

Format detection now runs against the post-redirect final_url first (with Content-Type as fallback inside the same call) and only falls back to the original URL as a last hint when the final URL gives no signal. This way a redirect that changes the archive extension or type is honored when choosing the on-disk suffix and dispatching to the correct extractor.

Copilot finished work on behalf of mnriem May 12, 2026 13:21
Comment thread src/specify_cli/__init__.py Fixed
# post-redirect URL first (with Content-Type fallback); only use the
# original `download_url` as a last hint if the final URL gives no
# signal.
final_url = download_url
# Determine the archive format from the post-redirect URL first
# (with Content-Type fallback); only use the original `download_url`
# as a last hint if the final URL gives no signal.
final_url = download_url
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 12, 2026 14:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

src/specify_cli/init.py:3679

  • final_url is checked for HTTPS only after archive_data = response.read(). This means an insecure redirect can still be fetched before you abort. To make the scheme-downgrade guard effective, validate the scheme right after response.geturl() and before reading the response body (or disallow non-HTTPS redirects in the opener).
                    with urllib.request.urlopen(from_url, timeout=60) as response:
                        final_url = response.geturl()
                        content_type = response.headers.get("Content-Type", "")
                        archive_fmt = detect_archive_format(final_url, content_type)
                        if not archive_fmt:
                            archive_fmt = detect_archive_format(from_url)
                        archive_data = response.read()

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment on lines 2166 to +2172
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
final_url = response.geturl()
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
if not archive_fmt:
archive_fmt = detect_archive_format(download_url)
archive_data = response.read()
Comment on lines 2322 to 2329
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
final_url = response.geturl()
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
if not archive_fmt:
archive_fmt = detect_archive_format(download_url)
archive_data = response.read()

Comment on lines 2638 to +2646
with urllib.request.urlopen(from_url, timeout=60) as response:
zip_path.write_bytes(response.read())
final_url = response.geturl()
content_type = response.headers.get("Content-Type", "")
# Prefer the post-redirect URL for format detection;
# fall back to the original URL only as a last hint.
archive_fmt = _det_fmt(final_url, content_type)
if not archive_fmt:
archive_fmt = _det_fmt(from_url)
archive_data = response.read()
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 12, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 12, 2026

@copilot apply changes based on the comments in this thread

Applied in 5d75366: in all four URL-download paths (extensions.py download_extension(), presets.py download_pack(), __init__.py preset add --from, __init__.py extension add --from), the post-redirect scheme re-validation now runs immediately after response.geturl() and before response.read(). A scheme-downgraded redirect now aborts before any body is fetched.

Copilot finished work on behalf of mnriem May 12, 2026 16:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

archive_fmt = detect_archive_format(final_url)
if not archive_fmt:
content_type = resp.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(final_url, content_type)
Comment on lines +5262 to +5266
# Detect archive format from the final URL or Content-Type header.
cat_archive_fmt = detect_archive_format(final_url)
if not cat_archive_fmt:
cat_ct = response.headers.get("Content-Type", "")
cat_archive_fmt = detect_archive_format(final_url, cat_ct)
Comment on lines +1366 to +1380
# Extract ZIP safely (prevent Zip Slip attack)
with zipfile.ZipFile(zip_path, 'r') as zf:
# Validate all paths first before extracting anything
temp_path_resolved = temp_path.resolve()
for member in zf.namelist():
member_path = (temp_path / member).resolve()
# Use is_relative_to for safe path containment check
try:
member_path.relative_to(temp_path_resolved)
except ValueError:
raise ValidationError(
f"Unsafe path in ZIP archive: {member} (potential path traversal)"
)
# Only extract after all paths are validated
zf.extractall(temp_path)
Comment on lines +1635 to +1646
with zipfile.ZipFile(zip_path, 'r') as zf:
temp_path_resolved = temp_path.resolve()
for member in zf.namelist():
member_path = (temp_path / member).resolve()
try:
member_path.relative_to(temp_path_resolved)
except ValueError:
raise PresetValidationError(
f"Unsafe path in ZIP archive: {member} "
"(potential path traversal)"
)
zf.extractall(temp_path)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support .tar.gz / .tgz archives for extension, preset, and workflow installation

3 participants